Analyzing the Cause of Air Pollution in Korea
Objective
Air pollution, especially particulate matter, is increasing in severity each year.
This project aims to :
1. Analyze different factors that affect air pollution, especially particulate matter, in Korea
2. Analyze the claims that transportation affects particulate matter in Korea.
3. Apply EDA (Exploratory Data Analysis) functions to represent our analysis graphically.
Problem
What are the factors that affect particulate matter in Korea?
Observation
According to WHO, transportation is increasing particulate matter globably(PM). WHO claims that road transports can contribute to 50% of PM emission in OECD countries. Note that Korea is a member of OECD.
Source: https://www.who.int/sustainable-development/transport/health-risks/air-pollution/en/
According to the Korean Particulate Matter Information Center, transportation is a main source of PM emission in Korea as well. Rather than simply blaming China for their lack of effort in mitigating pollution, Seoul should reduce particulate matter through efficient TDM (Transportation Demand Management).
Source: https://bluesky.seoul.go.kr/news-list/major-news/page/22?article=358
Starting February 15, the new “Particulate Matter Mitigation Act” introduces restrictions on transportation to mitigate PM emission in Seoul. The usage of “Level 5 Emission” Vehicles will be fined. In certain days in certain locations, only certain vehicles will be allowed to pass.
Source:https://bluesky.seoul.go.kr/finedust/emergency_reduction_measures
Hypothesis
Transportation is a main source of PM emission in Korea. If the new transportation law is implemented, it will lower the overall level of particulate matter in Seoul.
Data Wrangling
About the Dataset
Data Source: http://airemiss.nier.go.kr/common/downLoad.do?siteId=airemiss&fileSeq=411
This dataset provides the average air pollution emitted by different sectors in 2015.This dataset provides insight regarding emission by location, industry, and the type of fuel used. This dataset was used because dataset after 2015 were not availble.
- The dataset is provided by the Korean Ministry of Enviornment.
- This dataset was be used to analyze the effect of transporation in Korea.
Dataset
Raw dataset as downloaded from its source.
airPollution <- readxl::read_xlsx(path = "data.xlsx",
col_names = TRUE,
skip =3
)
airPollution %>% head() %>% kable %>% kable_styling(bootstrap_options = "striped", full_width = F)| 시도 | 시군구 | 배출원대분류 | 배출원중분류 | 배출원소분류 | 연료대분류 | CO | NOx | SOx | TSP | PM10 | PM2.5 | VOC | NH3 | BC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 서울특별시 | 종로구 | 비산업 연소 | 상업 및 공공기관시설 | 기타 | B-A유 | 981 | 9797 | 40785 | 137 | 126 | 120 | 110 | 157 | 1 |
| 서울특별시 | 종로구 | 비산업 연소 | 상업 및 공공기관시설 | 기타 | B-B유 | 29 | 119 | 205 | 6 | 5 | 2 | 9 | 5 | NA |
| 서울특별시 | 종로구 | 비산업 연소 | 상업 및 공공기관시설 | 기타 | B-C유 | 6268 | 69363 | 301648 | 3048 | 2795 | 1436 | 2016 | 1003 | 14 |
| 서울특별시 | 종로구 | 비산업 연소 | 상업 및 공공기관시설 | 기타 | 경유 | 4860 | 19439 | 350 | 138 | 126 | 81 | 243 | 778 | 8 |
| 서울특별시 | 종로구 | 비산업 연소 | 상업 및 공공기관시설 | 기타 | 등유 | 633 | 2279 | 11 | 5 | 5 | 4 | 32 | 101 | NA |
| 서울특별시 | 종로구 | 비산업 연소 | 상업 및 공공기관시설 | 기타 | LPG | 1020 | 4948 | 22 | 16 | 16 | 16 | 268 | 58 | 6 |
Translating
All nonnumeric columns were changed to factors.
The summary function reveals that around 1/3 of the data regarding pollution are missing. Column PM2.5 has 17005 missing values out of 44763; 38% of the data is missing. Due to the high variance as seen from the summary below, replacing the missing values will produce inaccurate results. Thus, rows with missing values for PM2.5 is removed.
colnames(airPollution)[1:6] <- c("Area", "Municipality", "Sector", "Industry", "Type of Industry", "Fuel Type")
airPollution <- airPollution[c(1:6, 11:12)]
airPollution[sapply(airPollution, is.character)] <- lapply(airPollution[sapply(airPollution, is.character)],
as.factor
)
airPollution %>% summary() Area Municipality Sector
경기도 : 8196 동구 : 926 도로이동오염원 :12142
경상북도 : 4328 중구 : 896 제조업 연소 : 7461
경상남도 : 4324 서구 : 878 비산먼지 : 5333
전라남도 : 3963 남구 : 848 비도로이동오염원: 4420
서울특별시: 3527 북구 : 682 생물성 연소 : 3964
충청남도 : 3123 강서구 : 365 비산업 연소 : 3395
(Other) :17302 (Other):40168 (Other) : 8048
Industry Type of Industry Fuel Type
기타 : 6568 기타 : 3807 기타 :16595
truck : 2887 소형 : 2834 경유 :11103
sedan : 2790 중형 : 2363 LNG : 5075
건설장비: 2268 경형 : 1530 LPG : 4259
van : 1831 대형 : 1487 휘발유 : 3874
분뇨관리: 1775 가구 및 기타제품 제조업: 894 (Other): 3641
(Other) :26644 (Other) :31848 NA's : 216
PM10 PM2.5
Min. : 1 Min. : 1
1st Qu.: 21 1st Qu.: 17
Median : 190 Median : 147
Mean : 8286 Mean : 3560
3rd Qu.: 1852 3rd Qu.: 1189
Max. :24349414 Max. :12681987
NA's :16622 NA's :17005
Summary of PM10
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 21 190 8286 1852 24349414 16622
Summary of PM10
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 17 147 3560 1189 12681987 17005
Total Emission
According the to graph, 경상북도, 전라남도, and 경기도, and 충청남도 are the top polluters for particulate matter. In fact, Seoul (서울특별시) emits 8.2 times less PM2.5 than 경상북도. This is most likely due to the heavy industries in those regions. Perhaps Korea should mitigate the overall PM level through more regulations in those provinces.
areaPM10 <- airPollution %>%
group_by(Area) %>%
summarise(TotalEmission = sum(PM10, na.rm = TRUE)) %>%
arrange(desc(TotalEmission))
barArea10 <- areaPM10 %>%
ggplot2::ggplot(mapping = aes(x = reorder(Area, -TotalEmission), y = TotalEmission))+
geom_bar(stat = "identity")+
theme(text= element_text(family = "AppleGothic"),
axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Total Emission of PM10 by Area")+
xlab("Area")
areaPM2.5 <- airPollution %>%
group_by(Area) %>%
summarise(TotalEmission = sum(PM2.5, na.rm = TRUE)) %>%
arrange(desc(TotalEmission))
barArea2.5 <- areaPM2.5 %>%
ggplot2::ggplot(mapping = aes(x = reorder(Area, -TotalEmission), y = TotalEmission))+
geom_bar(stat = "identity")+
theme(text= element_text(family = "AppleGothic"), axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Total Emission of PM2.5 by Area")+
xlab("Area")
table10 <- areaPM10 %>% head(10)
table2.5 <- areaPM2.5 %>% head(10)
kable(list(table10, table2.5), caption = "Top 10 PM Polluters (PM10:left, PM2.5:right)") %>%
kable_styling(position = "center")
|
|
Since the act affected mostly Seoul, the dataset was trimmed to include only Seoul (서울특별시).
Area Municipality Sector Industry
서울특별시:3527 강남구 : 160 도로이동오염원 :1172 기타 : 608
강원도 : 0 서초구 : 160 제조업 연소 : 609 sedan : 277
경기도 : 0 송파구 : 159 생물성 연소 : 348 truck : 265
경상남도 : 0 강서구 : 153 비산먼지 : 320 건설장비: 225
경상북도 : 0 영등포구: 150 유기용제 사용 : 293 van : 197
광주광역시: 0 금천구 : 146 비도로이동오염원: 286 road : 175
(Other) : 0 (Other) :2599 (Other) : 499 (Other) :1780
Type of Industry Fuel Type PM10 PM2.5
소형 : 296 기타 :1136 Min. : 1.0 Min. : 1.0
기타 : 280 경유 : 819 1st Qu.: 10.0 1st Qu.: 8.0
중형 : 250 LNG : 605 Median : 100.5 Median : 89.0
대형 : 154 휘발유 : 399 Mean : 4782.1 Mean : 1357.7
경형 : 152 LPG : 390 3rd Qu.: 1340.0 3rd Qu.: 930.2
특수 : 61 (Other): 171 Max. :628239.0 Max. :62824.0
(Other):2334 NA's : 7 NA's :1611 NA's :1627
Transportation
To discover the significance of transportation in PM emission in Seoul, the industry column were divided into “transportation” and non-“transportation.”
# Divide the data into two groups, "transportation" and "non-transportation"
transportation <- airPollution[, c("Industry", "PM10","PM2.5" )]However, the data contained 3238 missing values. Rows containing them were removed, leaving 1900 rows.
[1] 3238
According to the table, 34.7% of the industries in Seoul were categorized as “transportation”. Note that this is after removing rows with missing values; the “other” category was removed.
transportation <- transportation %>% na.omit()
transportation$type <- NA
for(i in 1:nrow(transportation)){
if(transportation$Industry[i] %in% c("bus", "road",
"offroad", "sedan", "van",
"bike", "taxi", "special", "truck")){
transportation$type[i] = "transportation"
}
else{
transportation$type[i] = "nontransportation"
}
}
transportation$type <- transportation$type %>% as.factor
transPercent <- transportation %>%
dplyr::group_by(type) %>%
dplyr::summarise(n = n()) %>%
plotly::plot_ly(labels = ~type, values = ~n, type = "pie") %>%
plotly::layout(title="% of Transportation")
transPercent30% of PM2.5 and 27% of PM10 were emitted from transportation.
Though the emission by transportation is not as high as 50% as indicated by WHO, transportation still remains a heavy culprit. Thus, if the Particulate Matter Mitigation Act is successful, it may yield significant results.
ft1 <- transportation %>%
dplyr::group_by(type) %>%
dplyr::summarise(PM2.5 = sum(PM2.5))
ft1$PM2.5 <- (ft1$PM2.5 /(ft1$PM2.5 %>% sum)) %>% round(digits = 2)
ft2 <- transportation %>%
dplyr::group_by(type) %>%
dplyr::summarise(PM10 = sum(PM10))
ft2$PM10 <- (ft2$PM10 /(ft2$PM10 %>% sum) )%>% round(digits = 2)
ft <- left_join(ft1, ft2)
ftGraph1 <- ggplot(ft, aes(x = "", y = PM2.5, fill = type))+
geom_bar(width = 1, stat= "identity")+
coord_polar("y") +
theme_minimal()+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid=element_blank(),
axis.ticks = element_blank(),
plot.title=element_text(size=14, face="bold")
) +
theme(axis.text.x=element_blank()) +
geom_text(aes(y = PM2.5/2,
label = PM2.5
), size=5
)+
ggtitle("PM2.5 Emission")+
theme(plot.title = element_text(hjust = 0.5))
ftGraph2 <- ggplot(ft, aes(x = "", y = PM10, fill = type))+
geom_bar(width = 1, stat= "identity")+
coord_polar("y") +
theme_minimal()+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid=element_blank(),
axis.ticks = element_blank(),
plot.title=element_text(size=14, face="bold")
) +
theme(axis.text.x=element_blank()) +
geom_text(aes(y = PM10/2,
label = PM10
), size=5
)+
ggtitle("PM10 Emission")+
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(ftGraph1, ftGraph2, nrow = 1)2018
This dataset lists the pollution level for different sectors of Seoul everyday of 2018. This dataset will be used to compare PM emission before and after the act.
air2018 <- readxl::read_xlsx(path = "ajswl2018.xlsx",
col_names = TRUE
)
air2018 %>% head() %>% kable| 측정일시 | 측정소명 | 이산화질소농도(ppm) | 오존농도(ppm) | 이산화탄소농도(ppm) | 아황산가스(ppm) | 미세먼지(㎍/㎥) | 초미세먼지(㎍/㎥) |
|---|---|---|---|---|---|---|---|
| 20180101 | 강남구 | 0.033 | 0.010 | 0.6 | 0.006 | 34 | 22 |
| 20180101 | 강남대로 | 0.040 | 0.007 | 0.8 | 0.006 | NA | 17 |
| 20180101 | 강동구 | 0.038 | 0.010 | 0.7 | 0.005 | 48 | 24 |
| 20180101 | 강변북로 | 0.033 | 0.008 | 0.6 | 0.005 | 48 | 15 |
| 20180101 | 강북구 | 0.026 | 0.018 | 0.6 | 0.004 | 38 | 18 |
| 20180101 | 강서구 | 0.036 | 0.012 | 0.7 | 0.004 | NA | 13 |
Only the columns date, sector, PM10, and PM2.5 were kept. Since the 2019 dataset contains values only till July 1st, the 2018 dataset was parsed to match accordingly. Date was changed to date object. Sector was factorized. Rows with missing values were eliminated since air pollution can change drastically day to day. The pollution level of the 46 districts were averaged because this project aims to measure to overall PM level of Seoul as a whole.
air2018 <- air2018[c(1,2,7,8)]
colnames(air2018) <- c("date","sector", "PM10", "PM2.5")
air2018$sector <-air2018$sector %>% as.factor
air2018 <- air2018 %>% na.omit()
air2018 <- air2018 %>%
filter(date <= "20180701")
air2018$date <- air2018$date %>% as.character() %>% as.Date(format ="%Y%m%d", origin = "1970-01-01")After eliminating the missing values, there were 128 rows left.
date sector PM10 PM2.5
Min. :2018-01-01 강남구 : 128 Min. : 4.00 Min. : 1.00
1st Qu.:2018-02-01 강동구 : 128 1st Qu.: 34.00 1st Qu.: 17.00
Median :2018-03-07 강변북로: 128 Median : 47.00 Median : 25.00
Mean :2018-03-15 강북구 : 128 Mean : 49.71 Mean : 28.35
3rd Qu.:2018-04-27 공항대로: 128 3rd Qu.: 63.00 3rd Qu.: 37.00
Max. :2018-07-01 관악구 : 128 Max. :154.00 Max. :127.00
(Other) :4155
air2018PM10 <- air2018 %>%
group_by(date) %>%
summarise(DailyEmission = sum(PM10))
air2018PM2.5 <- air2018 %>%
group_by(date) %>%
summarise(DailyEmission = sum(PM2.5)) The first graph describes the daily change of PM10 and PM2.5 level in Seoul in 2018. The graph is messy; furthermore, the lack of variance in .
The second graph describes the weekly average of PM emissions level in 2018.
air2018PM10day <- air2018PM10 %>%
ggplot(mapping = aes(x= date, y = DailyEmission))+
geom_line()+
ggtitle("Daily PM10 Level in Seoul in 2018")
air2018PM2.5day <- air2018PM2.5 %>%
ggplot(mapping = aes(x= date, y = DailyEmission))+
geom_line()+
ggtitle("Daily PM2.5 Level in Seoul in 2018")
grid.arrange(air2018PM10day, air2018PM2.5day, nrow = 1)week <- format(air2018PM10$date, format = "%W")
air2018PM10 <- cbind(air2018PM10, week)
week <- format(air2018PM2.5$date, format = "%W")
air2018PM2.5 <- cbind(air2018PM2.5, week)
air2018PM10$week <- air2018PM10$week %>% as.numeric()
air2018PM2.5$week <- air2018PM2.5$week %>% as.numeric()
air2018PM10week <- air2018PM10 %>%
group_by(week) %>%
summarise(WeeklyEmission = (mean(DailyEmission))) %>%
ggplot(mapping = aes(x= week, y = WeeklyEmission))+
geom_line()+
ggtitle("Weekly PM10 Level in Seoul in 2018")
air2018PM2.5week <- air2018PM2.5 %>%
group_by(week) %>%
summarise(WeeklyEmission = (mean(DailyEmission))) %>%
ggplot(mapping = aes(x= week, y = WeeklyEmission))+
geom_line()+
ggtitle("Weekly PM2.5 Level in Seoul in 2018")
grid.arrange(air2018PM10week,air2018PM2.5week, nrow = 1 )2019
This dataset lists the pollution level for different sectors of Seoul from July 1st, 2018 to July 1st, 2019.
Observations: 15,508
Variables: 8
$ 측정일시 <dbl> 20190701, 20190701, 20190701, 20190701, 201907…
$ 측정소명 <chr> "강남구", "강남대로", "강동구", "강변북로", "강북구", "강서구", "공…
$ `이산화질소농도(ppm)` <dbl> 0.013, 0.047, 0.016, 0.051, 0.010, 0.018, 0.033, …
$ `오존농도(ppm)` <dbl> 0.045, 0.037, 0.045, 0.029, 0.049, 0.053, 0.03…
$ `일산화탄소농도(ppm)` <dbl> 0.5, 0.5, 0.5, 0.4, 0.5, 0.5, 0.5, 0.4, 0.5, 0.7,…
$ `아황산가스(ppm)` <dbl> 0.005, 0.004, 0.003, 0.004, 0.002, 0.004, 0.006…
$ `미세먼지(㎍/㎥)` <dbl> 35, 49, 37, 42, 40, 35, NA, 40, 33, 38, 31, 30, …
$ `초미세먼지(㎍/㎥)` <dbl> 25, 27, 29, 31, 26, 26, NA, 28, 22, 35, 27, 23, 2…
Dataset was parsed so that the first date is January 1st, 2019. This dataset has the same structure as that of 2018; the same operations were performed. There were more unavailable data (839) in the 2019 data than 2018 data.
air2019 <- air2019[c(1,2,7,8)]
colnames(air2019) <- c("date","sector", "PM10", "PM2.5")
air2019 <- air2019 %>%
filter(date >= "20190101")
air2019$sector <-air2019$sector %>% as.factor
air2019 <- air2019 %>% na.omit()
air2019$date <- air2019$date %>% as.character() %>% as.Date(format ="%Y%m%d", origin = "1970-01-01")There were 175 rows left after eliminating rows with missing values.
date sector PM10 PM2.5
Min. :2019-01-01 강남구 : 175 Min. : 3.00 Min. : 2.00
1st Qu.:2019-02-13 강남대로: 175 1st Qu.: 34.00 1st Qu.: 18.00
Median :2019-04-05 강동구 : 175 Median : 46.00 Median : 25.00
Mean :2019-04-02 강변북로: 175 Mean : 53.15 Mean : 31.07
3rd Qu.:2019-05-19 강북구 : 175 3rd Qu.: 65.00 3rd Qu.: 37.00
Max. :2019-07-01 강서구 : 175 Max. :240.00 Max. :155.00
(Other) :6990
air2019PM10 <- air2019 %>%
group_by(date) %>%
summarise(DailyEmission = sum(PM10))
air2019PM2.5 <- air2019 %>%
group_by(date) %>%
summarise(DailyEmission = sum(PM2.5)) air2019PM10day <- air2019PM10 %>%
ggplot(mapping = aes(x= date, y = DailyEmission))+
geom_line()+
ggtitle("Daily PM10 Level in Seoul in 2018")
air2019PM2.5day <- air2019PM2.5 %>%
ggplot(mapping = aes(x= date, y = DailyEmission))+
geom_line()+
ggtitle("Daily PM2.5 Level in Seoul in 2018")
grid.arrange(air2019PM10day, air2019PM2.5day, nrow = 1)week <- format(air2019PM10$date, format = "%W")
air2019PM10 <- cbind(air2019PM10, week)
week <- format(air2019PM2.5$date, format = "%W")
air2019PM2.5 <- cbind(air2019PM2.5, week)
air2019PM10$week <- air2019PM10$week %>% as.numeric()
air2019PM2.5$week <- air2019PM2.5$week %>% as.numeric()
air2019PM10week <- air2019PM10 %>%
group_by(week) %>%
summarise(WeeklyEmission = (mean(DailyEmission))) %>%
ggplot(mapping = aes(x= week, y = WeeklyEmission))+
geom_line()+
ggtitle("Weekly PM10 Level in Seoul in 2019")
air2019PM2.5week <- air2019PM2.5 %>%
group_by(week) %>%
summarise(WeeklyEmission = (mean(DailyEmission))) %>%
ggplot(mapping = aes(x= week, y = WeeklyEmission))+
geom_line()+
ggtitle("Weekly PM2.5 Level in Seoul in 2019")
grid.arrange(air2019PM10week,air2019PM2.5week, nrow = 1 ) Likewise, The first graph describes the daily change of PM10 and PM2.5 level in Seoul in 2019.
The second graph describes the weekly average of PM emissions level in 2019.
colnames(air2018)[3:4] <- c("PM10_2018", "PM2.5_2018")
colnames(air2019)[3:4] <- c("PM10_2019", "PM2.5_2019")
total <- left_join(air2018,air2019)There are many factors that effect the magnitude air pollution; season is one. This project fails to capture the different variables.
It would be ideal to compare the air pollution level before and after the act. However, such would result in inaccuracy due to the inevitable variance due to season.